Section 1.1 - Understanding Distributed Systems and how Spark Works

Working with Spark requires a different kind of thinking. Your code isn't executing in a sequential manor anymore, it's being executing in parallel. To write performant parallel code, you will need to think about how you can perform as different/same tasks at the same while minimizing the blocking of other tasks. Hopefully by understanding how spark and distributed systems work you can get into the right mindset and write clean parallel spark code.

Understanding Distributed Systems/Computing

Sequential Applications

In a generic application the code path is run in a sequential order. As in the code will execute from the top of the file to the bottom of the file, line by line.

Multiprocess/threaded Applications

In a multi-processed/threaded application the code path will diverge. The application will assign a portions of the code to threads/processes which will handle these tasks in an asynchronous manner. Once these tasks are completed the threads/processes will signal the main application and the code path will conform again.

These threads/processes are allocated a certain amount of memory and processing power on a single machine where the main application is running.

Think about how your computer can handle multiple applications at once. This is multiple processes running on a single machine.

Distributed Computing (Clusters and Nodes)

Spark is a distributed computing library that can be used in either Python, Scala, Java or R. When we say "distributed computing" we mean that our application runs on multiple machines/computers called Nodes which are all part of a single Cluster.

This is very similar to how a multi-processed application would work, just with more processing juice. Each Node is essentially a computer running multiple process running at once.

Master/Slave Architecture

From the image in the last section we can see there are 2 types of Nodes, a single Driver/Master Node and multiple Worker Nodes.

Each machine is assigned a portion of the overall work in the application. Through communication and message passing the machines attempt to compute each portion of the work in parallel. When the work is dependent on another portion of work, then it will have to wait until that work is computed and passed to the worker. When every portion of the work is done, it is all sent back to the Master Node. The coordination of the communication and message passing is done by the Master Node talking to the Worker Nodes.

You can think of it as a team lead with multiple engineers, UX, designers, etc. The lead assigns tasks to everyone. The designers and UX collaborate to create an effective interface that is user friendly, once they are done they report back to the lead. The lead will pass this information to engineers and they can start coding the interface, etc.

Lazy Execution

When you're writing a spark application, no work is actually being done until you perform a collect action. As seen in some examples in the previous chapters, a collect action is when you want to see the results of your spark transformations in the form of a toPandas(), show(), etc. This triggers the Driver to start the distribution of work, etc.

For now this is all you need to know, we will look into why Spark works this way and why it's a desired design decision.

MapReduce

When the Driver Node actually starts to do some work, it communications and distributes work using a technique called "MapReduce". There are two essential behaviors of a MapReduce application, map and reduce.

Map

When the Driver Node is distributing the work it maps the 1) data and 2) transformations to each Worker Node. This allows the Worker Nodes to perform the work (transformations) on the associated data in parallel.

Ex.

# initialize your variables
x = 5
y = 10

# do some transformations to your variables
x = x * 2
y = y + 2

Here we can see that the arithmetic operations performed on x and y can be done independently, so we can do those 2 operations in parallel on two different Worker Nodes. So Spark will map x and the operation x * 2 to a Worker Node and y and y + 2 to another Worker Node.

Think about this but on a larger scale, we have 1 billion rows of numbers that we want to increment by 1. We will map portions of the data to each Worker Node and the operation + 1 to each Worker Node.

Reduce

When work can't be done in parallel and is dependent on some previous work, the post transformed data is sent back to the Driver Node from all the Worker Nodes. There the new data may be redistributed to the Worker Nodes to resume execution or execution is done on the Driver Node depending on the type of work.

Ex.

# initialize your variables
x = 5
y = 10

# do some transformations to your variables
x = x * 2
y = y + 2

# do some more transformations
z = x + y

Similar to the example above, but here we see that the last transformation z = x + y depends on the previous transformations. So we will need to collect all the work done on the Worker Nodes to the Driver Node and perform the final transformation.

Key Terms

Driver Node

Learnt above.

Worker

Learnt above.

Executor

A process launched from an application on a Worker Node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.

Jobs

Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action.

Stages

Smaller set of tasks inside any job.

Tasks

Unit of work that will be sent to one executor.

Source

Source

results matching ""

    No results matching ""